fix(tui,mcp,acpx): production-ready coder→reviewer loop via Nexus IPC#239
Merged
windoliver merged 5 commits intomainfrom Apr 11, 2026
Merged
fix(tui,mcp,acpx): production-ready coder→reviewer loop via Nexus IPC#239windoliver merged 5 commits intomainfrom
windoliver merged 5 commits intomainfrom
Conversation
End-to-end fix for the full review-loop path: Grove TUI → acpx(claude) →
grove MCP stdio → Nexus contribution store → TopologyRouter handoff →
reviewer IPC push → grove_submit_review → reviewer→coder handoff.
Validated locally: both coder→reviewer and reviewer→coder handoffs
delivered in the TUI Handoffs panel, hello.txt loop completes via native
`mcp__grove__grove_cas_put` / `grove_submit_work` / `grove_submit_review`
tool calls — no curl fallback, no SQLite split-brain.
Root causes fixed:
1. acpx DEFAULT_AGENT hardcoded to "codex"
src/tui/main.ts + src/core/acpx-runtime.ts
Every role was spawned as codex regardless of the launch-preview CLI
mapping. Now AcpxRuntime.spawn() extracts the agent (claude/codex/
gemini/pi/openclaw) from AgentConfig.command and passes it through to
`acpx <agent>`. main.ts drops the hardcoded override.
2. Grove wrote .mcp.json but acpx reads .acpxrc.json
src/tui/spawn-manager.ts
acpx (even in v0.5.3) does not read claude-code's .mcp.json format;
it expects an .acpxrc.json with mcpServers as an array and env as
[{name,value}]. spawn-manager now writes both files on every spawn
and protects .acpxrc.json from agent mutation.
3. acpx 0.3.1 had a broken claude-agent-acp adapter
src/core/acpx-runtime.ts
0.3.x pulled @zed-industries/claude-agent-acp@^0.21.0 which failed
session/new with "Query closed before response received". 0.5.3
switched to @agentclientprotocol/claude-agent-acp@^0.25.0. Added a
startup version check that surfaces a clear upgrade hint.
4. grove MCP server threw on Nexus-mode startup
src/mcp/serve.ts
createLocalRuntime({parseContract: true}) tried to load the session
config from local SQLite, which doesn't have the record in Nexus
mode — throwing "Session has no stored config" and killing the MCP
server. serve.ts now skips the local parse when GROVE_NEXUS_URL is
set and reconstructs the contract from NexusSessionStore.getSession.
5. Nexus session record had no frozen contract
src/server/routes/sessions.ts + src/tui/provider-shared.ts +
src/mcp/serve.ts
toSessionResponse() stripped the `config` field before returning, so
the mirrored Nexus session only had `topology`. serve.ts now loads
the authoritative GroveContract from the session record, giving real
rate-limit enforcement in nexus mode instead of a minimal stub.
6. serve-http.ts was SQLite-only
src/mcp/serve-http.ts
The HTTP MCP transport ignored GROVE_NEXUS_URL entirely and wrote
every contribution to the local grove.db. Bypassed handoff routing
for any caller hitting :4015/mcp. Now mirrors serve.ts: Nexus stores
+ NexusHandoffStore + TopologyRouter when GROVE_NEXUS_URL is set.
Reads GROVE_SESSION_ID from .grove/current-session.json when the
env var is unavailable (serve-http.ts is spawned before sessions
are created).
7. TUI didn't write current-session.json
src/tui/screens/screen-manager.tsx
Added a best-effort write on setSessionScope so serve-http.ts can
pick up the session ID without restarting.
8. spawn-manager didn't pass GROVE_NEXUS_URL to agent MCP servers
src/tui/spawn-manager.ts
writeMcpConfig built mcpEnv from process.env but the TUI was often
launched without GROVE_NEXUS_URL in env (nexusUrl lives in
grove.json). Added a fallback that reads this.groveDir/grove.json
when the env var is missing. Without this, the spawned grove MCP
subprocess fell back to local SQLite and the TUI never saw
contributions in the Nexus feed.
9. nexus-lifecycle GROVE_NEXUS_URL priority
src/cli/nexus-lifecycle.ts
Env var was last in the candidateUrls list so a healthy-looking but
broken Nexus would always win. Moved to first position.
10. tmux-runtime wrong socket
src/core/tmux-runtime.ts
listSessions() and capture-pane hit the default tmux server; spawn
uses `-L grove`. They were talking to different daemons — sessions
created by grove were invisible to grove's own listSessions. Added
`-L grove` to both commands.
11. codex mcp registration blocked every spawn ~3s
src/tui/spawn-manager.ts
`codex mcp remove` + `codex mcp add` in writeMcpConfig were synchronous
and added ~3s per spawn, blocking the spawn chain on machines where
codex is installed. Moved to fire-and-forget since the claude path
reads .acpxrc.json directly and codex reads its config lazily.
Other fixes:
- src/tui/provider-shared.ts: ApiSessionResponse now includes `config`
- tests/presets/preset-e2e-nexus.test.ts: rpc() sends NEXUS_API_KEY
auth header when set, surfaces non-2xx bodies. Dropped broken
swarm-ops contribution assertion (preset enforces a has_relation
gate that a bare CLI contribute can't satisfy; seed data was removed
in 6da5494 without updating this expectation).
- src/tui/spawn-manager.test.ts: setDefaultTimeout(30_000) for spawn
tests that do real git worktree I/O.
- src/core/acpx-runtime.test.ts: regex matches the new session-name
format (grove-<role>-<counter>-<timestamp>), timeout bumped to 30s.
Test coverage: `bun test` → 4878 pass, 0 fail (was 4873 / 5 fail).
`bun x tsc --noEmit` clean. `bun x biome check` clean on changed files.
Requires: acpx >= 0.5.3 (`npm install -g acpx@latest`).
Ran 8 rounds of adversarial review against the previous commit's session bootstrap / contract loading / spawn ordering paths. Each round surfaced real correctness gaps; this commit lands the fixes for all of them. Hardening summary by file: src/mcp/serve.ts - Fail closed on unreachable Nexus at startup instead of silently downgrading to local stores (would otherwise split-brain writes with the TUI). - Retry Nexus session lookup with exponential backoff before declaring the record missing, to absorb the TUI → Nexus mirror race. - Accept topology-only session records (legacy / grove_create_session tool) with a loud WARN by DEFAULT; opt into strict `config`-only mode via GROVE_MCP_STRICT_CONTRACT=1. Prevents backcompat breakage while giving ops a knob to tighten enforcement. - Exit the process when a session ID is set but Nexus returns no record at all (vs a topology-only record); the old code path kept writing to orphan VFS paths. src/mcp/serve-http.ts - Lazily resolve session-scoped deps per incoming HTTP request instead of baking sessionId into stores at process boot. serve-http.ts is started before the interactive session exists, so the previous static scoping was always wrong for fresh clients. - Invalidate MCP sessions whose bound grove sessionId no longer matches the current one so a grove session switch doesn't leave long-lived HTTP clients writing under the prior scope. - Reject new mutating MCP initialization in Nexus bootstrap mode (no grove session yet) with 503 SESSION_NOT_READY. Bootstrap-scoped McpServers would capture unscoped stores for their entire lifetime. - Differentiate "file missing" (valid pre-session state) from "file unreadable/corrupt" via SessionStateReadError, and only invalidate live sessions after a SUCCESSFUL read. Atomic writes on the TUI side (tmp + rename) and a read-error-tolerant handler keep concurrent session switches from tearing down in-flight requests. - Update the current-session.json mtime cache only AFTER a successful parse; a torn-write read no longer poisons the cache. - Mirror the strict/weak contract policy from serve.ts. src/tui/spawn-manager.ts - Serialize codex MCP registration through a per-session promise chain on the SpawnManager instance. Parallel role spawns in the same session previously raced remove+add on the single global "grove" entry. - Use stable "grove" MCP name (not per-session) to avoid leaking stale entries and persisted secrets in ~/.codex/config.toml over time. Sweep any legacy `grove-*` entries on first spawn. - Replace execSync shell strings with spawnSync argv arrays in the codex mcp add/remove path so paths with spaces / shell metacharacters can't break the command. - Only cache successful registrations; a failed first attempt clears the cache so the next spawn retries. - Probe `codex --version` once and skip registration entirely when codex is not installed (claude path reads .acpxrc.json directly). - writeMcpConfig swallows codex-registration errors softly so claude roles aren't blocked, but spawn() re-calls ensureCodexMcpRegistered with the real error propagated when the role is specifically codex. src/tui/nexus-provider.ts - Expose `mode = "nexus" as const` as a public discriminator so callers can distinguish Nexus from local providers without peeking at protected fields. - Await the Nexus session mirror in createSession and retry a few times; on final failure, archive the just-created local/server session so repeated retries don't accumulate orphan active records. Throws so the TUI can surface the error instead of returning a broken session id. src/tui/screens/screen-manager.tsx - Refuse the synthetic-UUID fallback on Nexus session-creation failures (uses the new provider.mode discriminator); the stdio MCP server now fails closed when the session record is missing, so a fabricated ID would immediately crash every spawned agent. - Write .grove/current-session.json on BOTH new-session and resume paths so the HTTP MCP server's session scoping stays fresh across restarts and resumes. - Atomic writes via temp-file + rename so concurrent readers in serve-http.ts never observe truncated JSON during a session switch. src/cli/main.tui-dispatch.test.ts - Pre-existing flake: bun cold-start of main.ts exceeds the hardcoded 2s kill deadline on warm machines. Bumped inner timeout to 5s and the outer bun:test timeout to 15s so the race finishes cleanly. Test coverage: `bun test` → 4878 pass, 0 fail. `bun x tsc --noEmit` clean. `bun x biome check` clean on changed files.
Lands the safe subset of the sibling WIP that was in the worktree
alongside the Nexus/MCP fixes. Two themes:
1. Debug log hygiene
- src/tui/debug-log.ts: add `ENABLED` gate keyed off GROVE_DEBUG=1
so the unconditional `/tmp/grove-debug.log` writes in hot paths
become no-ops in normal operation. Before: every NexusHandoffStore
read/write and every NexusContributionStore put/list wrote a
timestamped line. After: only when GROVE_DEBUG=1 is set.
- src/nexus/nexus-handoff-store.ts: replace 6× inlined
`appendFileSync("/tmp/grove-debug.log", ...)` blocks with calls
to the shared `debugLog()` helper. Net -53 lines of boilerplate
and one consistent gate instead of six unguarded writes.
- src/nexus/nexus-contribution-store.ts: same cleanup; net -22
lines. Reuses the existing `manifestPath` local rather than
recomputing it inside the log.
2. contributeOperation handoff fan-out improvements
- src/core/operations/contribute.ts: the `createMany` fast path
already existed; the per-input fallback was serial and swallowed
every error silently. Switch the fallback to `Promise.allSettled`
so N downstream handoffs pay 1×RTT instead of N×RTT, and
surface per-failure reasons via `console.warn` so operators can
diagnose routing gaps instead of noticing handoffs missing from
the UI without any log trail. The best-effort semantics are
preserved: a failure does not abort the contribution that was
already committed.
- src/core/operations/contribute.test.ts: add a describe block
exercising the serial (fallback) path with an injected faulty
handoff store. Verifies (a) the contribution is still durably
persisted when every handoff create throws and (b) a single
`console.warn` fires for the failed batch.
3. Doc-only clarifications
- src/local/sqlite-store.ts: extend the `putWithCowrite` JSDoc to
spell out the duck-type capability contract used by
contributeOperation's `writeAtomic` selector.
- src/local/sqlite-handoff-store.ts: matching comment on
`insertSync` pointing at the same selector rule.
Explicitly NOT included in this commit (separate issue to follow):
`src/core/operations/bounty.ts` and `src/core/operations/bounty.test.ts`.
Three codex adversarial-review rounds flagged the new compensation
pattern there (settle-before-pay corruption + post-commit reservation
rollback + post-commit claim release). Those fixes need a saga /
outbox / reconciler design decision, not a mechanical change, so
they stay in the worktree until the design is resolved.
Test coverage: `bun test` (with NEXUS env) → 4878 pass, 0 fail.
`bun x tsc --noEmit` clean. `bun x biome check` clean on changed files.
windoliver
added a commit
that referenced
this pull request
Apr 11, 2026
…ED ON #240) **DO NOT MERGE** — this is a draft PR tracking in-progress work on bounty operation compensation. Three HIGH/CRITICAL correctness bugs have been flagged by adversarial review across three independent rounds; see #240 for the findings, design options, and recommended next steps. ## What this adds - `createBountyOperation`: try/catch that voids the credit reservation when `bountyStore.createBounty()` throws, so escrow is returned on creation failure. - `claimBountyOperation`: pre-flight check that only `open` bounties can be claimed, plus try/catch that releases the claim record when `bountyStore.claimBounty()` throws so other agents aren't blocked until the lease expires. - `settleBountyOperation`: reordered to persist `completed → settled` state transitions BEFORE calling `creditsService.capture()`, with a comment explaining the intent (make the failure recoverable by retry). - Test coverage for each compensation path using mock stores. - Extract `MISSING_BOUNTY_STORE` constant for the repeated error message. ## Known bugs (see #240 for details) 1. **[CRITICAL]** Settlement now commits terminal `settled` state before `capture()` runs. A capture failure leaves the bounty paid according to the state machine but with no funds actually transferred, and there is no reconciler in the codebase to repair it. src/core/operations/bounty.ts:319-333 2. **[HIGH]** The `createBounty` catch voids the reservation on any error, including post-commit failures from the NexusBountyStore two-step write (bounty doc first, then status index). A late failure leaks an "open" bounty with no escrow backing it. src/core/operations/bounty.ts:156-176 3. **[HIGH]** Same shape on `claimBounty`: the compensation releases the claim even when the bounty state transition actually persisted, leaving the bounty stuck in `claimed` pointing at a released claim. src/core/operations/bounty.ts:258-269 ## Design options for the fix See #240 — three paths, roughly: - **(a)** Full saga: `pending_settlement` state + reconciler + CAS confirmation on post-commit failures. ~200-500 lines, 1-2 days. - **(b)** CAS confirmation only: re-read before compensating so we don't rollback committed state. Doesn't fix the settle-before-pay race. ~100-200 lines. - **(c)** Revert the compensation refactor, keep only the cosmetic changes (constant extract + pre-flight claim check). Relies on reservation/lease expiry for recovery. ~50-100 lines. The issue recommends starting with (c) to stop the bleed, then doing (a) as a proper saga-log project with its own RFC. ## Why open this as a draft PR now This code existed in an uncommitted worktree alongside the `fix(tui,mcp,acpx)` work in PR #239. Rather than leave it in a limbo state where it can be accidentally restaged on the next push, this branch makes it visible, reviewable, and explicitly linked to the design issue it depends on. Do not merge until #240 has a resolved design and the code matches.
5 tasks
CI caught two lint errors that `bun x biome check` didn't surface when I ran it against only the changed files in the previous commits. Running the full `bun run check` the way CI does reproduced both. - src/core/acpx-runtime.test.ts: previous auto-fix left the `test()` call with the 30s timeout arg on a broken trailing-comment line that biome's formatter rejected on the second pass. Rewrote to the canonical 3-arg form with a single blank line between the arrow body and the timeout. - src/tui/spawn-manager.test.ts: the `setDefaultTimeout(30_000)` call I added in a prior round split the import block in two, which biome's `assist/source/organizeImports` rule flagged. Added a blank line between the call and the subsequent type imports so the import group is contiguous. No behavior change. `bun run check` is clean now (16 pre-existing warnings unrelated to my files). Full test suite still 4870 pass, 0 fail with NEXUS env set.
- contribute.ts: wrap parallel handoff create() in Promise.resolve().then() so a synchronous throw becomes a rejected promise instead of escaping .map() and bypassing allSettled. Without this, a sync throw from a HandoffStore extension would surface as an operation error after the contribution was already committed, releasing the idempotency slot and allowing duplicate contributions on retry. - contribute.test.ts: regression test with a HandoffStore whose create() throws synchronously and no createMany (forces the fallback path). - spawn-manager.ts: forward GROVE_DEBUG to spawned MCP agents via both .mcp.json/.acpxrc.json env and the codex mcp registration. debugLog() in NexusContributionStore/NexusHandoffStore reads the env var at module load inside the child grove-mcp process, which does not inherit env from the parent TUI shell — so without this passthrough turning on GROVE_DEBUG=1 in the parent never re-enabled agent-side traces.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end fix for the full review-loop path: Grove TUI → acpx(claude) → grove MCP stdio → Nexus contribution store → TopologyRouter handoff → reviewer IPC push →
grove_submit_review→ reviewer→coder handoff.Validated locally: both
coder→reviewerandreviewer→coderhandoffs delivered in the TUI Handoffs panel, the hello.txt loop completes via nativemcp__grove__grove_cas_put/grove_submit_work/grove_submit_reviewtool calls — no curl fallback, no SQLite split-brain.Root causes fixed
DEFAULT_AGENThardcoded to"codex"— every role spawned as codex regardless of the launch-preview CLI mapping.AcpxRuntime.spawn()now derives the agent fromAgentConfig.command..mcp.jsonbut acpx reads.acpxrc.json— totally incompatible schemas.spawn-managernow writes both files per spawn.0.3.1had a broken claude-agent-acp adapter —session/newfailed with "Query closed before response received". Requires>= 0.5.3now; startup check surfaces an upgrade hint.createLocalRuntime({parseContract: true})hit "Session has no stored config" because the session record lives in Nexus VFS, not local SQLite.serve.tsnow skips the local parse in nexus mode and loads the contract fromNexusSessionStore.getSession.toSessionResponse()stripped theconfigfield. Response andmapApiSessionnow preserve it so the mirrored Nexus record carries the fullGroveContractfor real enforcement.serve-http.tswas SQLite-only — the HTTP MCP transport ignoredGROVE_NEXUS_URLand bypassed handoff routing. Now mirrorsserve.ts: Nexus stores +NexusHandoffStore+TopologyRouterwhenGROVE_NEXUS_URLis set. ReadsGROVE_SESSION_IDfrom.grove/current-session.json(TUI writes this file onsetSessionScope).spawn-managerdidn't passGROVE_NEXUS_URLto agent MCP servers — when the TUI was launched without the env var (common — nexusUrl lives ingrove.json), the spawned grove MCP subprocess fell back to local SQLite and the TUI never saw contributions in the Nexus feed. Added agrove.jsonfallback lookup.nexus-lifecycleGROVE_NEXUS_URLpriority — env var was last in thecandidateUrlslist so a healthy-looking but broken Nexus would always win. Moved to first position.tmux-runtimewrong socket —listSessionsandcapture-panehit the default tmux server whilespawnuses-L grove. They were talking to different daemons; grove's own sessions were invisible to itslistSessions.codex mcpregistration blocked every spawn ~3s —codex mcp remove+codex mcp addwere synchronous inwriteMcpConfig. Moved to fire-and-forget (claude reads.acpxrc.jsondirectly; codex reads its config lazily).Other fixes
ApiSessionResponsenow carriesconfigthroughmapApiSession.tests/presets/preset-e2e-nexus.test.ts—rpc()sendsNEXUS_API_KEYauth header when set and surfaces non-2xx bodies. Dropped the broken swarm-ops contribution assertion (the preset enforces ahas_relationgate that a bare CLI contribute can't satisfy; seed data was removed in 6da5494 without updating this expectation).src/tui/spawn-manager.test.ts—setDefaultTimeout(30_000)for tests that do realgit worktreeI/O.src/core/acpx-runtime.test.ts— regex now matches the new session-name format (grove-<role>-<counter>-<timestamp>), timeout bumped to 30s for coldnpxfetches.Validation
bun x tsc --noEmit— clean.bun x biome check— clean on changed files (4 pre-existing warnings untouched).bun test— 4878 pass, 0 fail (was 4873 pass / 5 fail before).grove up→ new session → review-loop → "write hello world in hello.txt" → claude coder callsmcp__grove__grove_cas_put+grove_submit_work→routedTo:["reviewer"]+handoffIds→ IPC push → claude reviewer callsgrove_submit_review+grove_done→ TUI Handoffs panel shows 2 total, 0 pending, both delivered.Requires
acpx
>= 0.5.3—npm install -g acpx@latest. The runtime throws a clear upgrade hint if older.Test plan
grove upon a worktree withgrove.jsonthat hasnexusUrl— verify the feed shows a coder contribution and Handoffs panel shows bothcoder→reviewerandreviewer→coderdeliveries.acpx --version< 0.5.3 — verify the runtime surfaces the upgrade hint and refuses to spawn.NEXUS_URL=http://... NEXUS_API_KEY=... bun test tests/presets/preset-e2e-nexus.test.ts— all 14 pass.bun teston main — 4878 pass / 0 fail.